Objective: + Visualization and Intution of Principal Componenet Analysis + Comprehending Clustering as a unsupervised Learning Algorithm. + Clustering using kmeans, hierarchical, density methodologies. + Creating tree clusters, heatmaps and silhouettes. + Data is segmented by a similarity criterion using distance metrics.
PCA Principal Component Analysis
+The total number of dimensions in a dataset is represented by the number of variables in it. Contextually if our data has 35 features then our data composes of 35 dimensions. It is difficult to decipher the substantive patterns within our multivariate dataset.
+To offset the aforementioned issue PCA creates components which are new variables. Each component is a linear combinations of all the previous variables which in our scenarios are the 35 feature.
+New Variables:
Variable 1: Componenent1
\[ y_1=\beta_1x_1+\beta_2x_2{+\beta}_3x_3+.....................\beta_{35}x_{35}\]
Variable 2 Component 2
\[y_1=\beta_1x_1+\beta_2x_2{+\beta}_3x_3+.....................\beta_{35}x_{35}\]
………………………… +Mathematically in comparison to the original large set of variables the newer albeit smaller set of variables named Principal Components extract credible and substantive information from the data by numerically and visually quantifying the maximal variability of the data. Principal components are linear combinations of the original variables that are progressively orthogonal to subsequent components and they capture 100% variability within the data in decreasing order of precedence. +The first principal component captures the largest variability depicted by a directional perspective on the first PC1 axis whereas the second principal component captures the second largest variability on the second PC2 axis.
+Numerically within the PCA framework the variability is measured by process of eigenvalues decomposition of covariance matrix or singular value decomposition of the data matrix after initial normalization of the data.
+PCA can also be used in conjunction with other unsupervised learning techniques like knn clustering or hierarchical clustering to showcase inherent data structures.
PCA
PCA
PCA
Each feature is a dimension therefore 10 features space implies that the data represents a 10 dimensional space.
If k=3 then initially the algorithm randomly assigns three centers to three clusters.
K means is very sensitive to the starting center therefore sometimes optimal centers are not obtained.
To offset this issue many solutions have been identified like initial centers need not be a value from the data examples themselves. K-means++ improves the performance by a mathematical algorithm that allows for achieving optimality.
Once the cluster center is determined the examples are assigned to the appropriate cluster depending on the their distance from each of the centers.The common distance metri used is the Euclidean distance.
\[dist(x,y)=\sqrt{\sum_{i=1}^{n}{(x_i-y_i)}^2}\]
The Euclidean distance formula is used to compute the distance between each example and each center. Each example is assigned to the center which has the smallest distance from it.
The three clusters formed are separated into Cluster A and Cluster B and Cluster C. The cluster boundaries are called Voronoi Diagram. The vertex of the three clusters is mathematically farthest from the three centers.
Subsequent to the initial phase the new center is updated by the centroid calculated from the current examples assigned to each cluster.
The distance measurement is now recalculated to reassign all the examples to the appropriate cluster.
The process works recursively and reassignment of centers and examples takes place until the reassignment becomes stable and does not change. The cluster assignments are final.
Finally cluster centers are determined and the examples are fully segmented.
K-means is sensitive to the number of clusters. If k is too large then the clusters are more homogeneous but the data can get overfitted.
Mostly a priori knowledge can facilitate the decision making for the number of clusters. For examples we would decide on the number of clusters on the basis of movie genres for a movies dataset.
The decision related to k number of clusters could be determined by business decisions.
A well implemented technique called Elbow method helps to determine the size of k based upon the criterion of increasing homogeneity and decreasing heterogeneity.
The objective is to determine the balance point beyond which the homogeneity or heterogeneity does not change.
Numerous Statistical techniques are used to create the elbow graph to provide insights into the data.
To obtain the best partitioned clusters the Within cluster Sum of Squares has to be minimized. A &= _{n=-}^{+} f(x) \
\[WCSS=\sum_{k=1}^{k}\sum_{x_i\epsilon C_k}(x_i-\mu_k)^2\]
k-means Clustering
1.** Manhattan Distance**
\[d_m(x,y)=\sum_{k=1}^{n}|(x_i-y_i)|\]
2.** Pearson Correlation Coefficient**
\[d_c=1-\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt(\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar y)^2)}\]
3.** Cosine Correlation**
\[d_e=1-\frac{|\sum_{i=1}^{n}(x_i*y_i)|}{\sqrt(\sum_{i=1}^n(x_i)^2\sum_{i=1}^{n}(y_i)^2)}\]
5.** Spearman’s Correlation Distance** The x values are representing ranks.
\[d_s=1-\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt(\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar y)^2)}\]
Clustering Techniques that overcome the weakness of kmeans is PAM(Partitioning Around Medoids).Additionally CLARA(Clustering Large applicatio is used).PAM uses a data point of the original data set as a representative for each cluster.This is not the mean of the cluster since mean is sensitive to outliers.Like the k-means algorithm, Dissimilarity matrix is calculated by using diverse distance measures. This method requires the k number of clusters to be specified and this can be ascertained by the Silhouette Mehtod.
The initially assigned medoid is checked for dissimilarity metric.The data points are reassigned for obtainig a better medoid till convergence occurs.
To obtain optimal clusters Statistical methods like Elbow, Silhouhette and Gap are used. Gap is a Statisical method so is the most reliable,
library(readxl)
library(ggplot2)
library(dplyr)
library(corrplot)
library(caret)
library(cluster)
library(factoextra)
library(magrittr)
library(fpc)
setwd('C:\\projects\\MachineLearning')
heartdata_initial<-read_excel("heart.xlsx")
heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)
summary(heartdata_initial)
## Age Sex ChestPainType RestingBloodPressure Cholesterol
## Min. :29.00 0: 96 0:143 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 1:207 1: 50 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
## 0:258 0:147 Min. : 71.0
## 1: 45 1:152 1st Qu.:133.5
## 2: 4 Median :153.0
## Mean :149.6
## 3rd Qu.:166.0
## Max. :202.0
## ExerciseInducedAngina Oldpeak Slope ColoredVessels Thal
## 0:204 Min. :0.00 Min. :0.000 Min. :0.0000 0: 2
## 1: 99 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000 1: 18
## Median :0.80 Median :1.000 Median :0.0000 2:166
## Mean :1.04 Mean :1.399 Mean :0.7294 3:117
## 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.20 Max. :2.000 Max. :4.0000
## HeartAttackRisk
## 0:138
## 1:165
##
##
##
##
heartdata<-heartdata_initial[c(1,4,5,8,10,12)]
heartdata_scaled<-scale(heartdata)
# Looking at the optimal clusters by elbow method
optimalclusters<-fviz_nbclust(heartdata_scaled,kmeans,method="wss")
print(optimalclusters)
# Running the k-means clustering algorithm.
kmeansdf<-kmeans(heartdata_scaled,2,nstart=25)
print(kmeansdf)
## K-means clustering with 2 clusters of sizes 139, 164
##
## Cluster means:
## Age RestingBloodPressure Cholesterol MaximumHeartRate Oldpeak
## 1 0.6923882 0.3775110 0.2632115 -0.6121313 0.5377989
## 2 -0.5868412 -0.3199636 -0.2230878 0.5188186 -0.4558173
## ColoredVessels
## 1 0.4827357
## 2 -0.4091480
##
## Clustering vector:
## [1] 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 1 1 2 1 2 2 2 2 2 1 2 2
## [38] 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 2 2 1
## [112] 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 2 2
## [149] 2 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 2 1 1
## [186] 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 2 1 1
## [223] 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1
## [260] 2 1 2 1 1 2 1 1 1 1 1 2 1 1 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1
## [297] 2 1 1 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 830.6576 573.4849
## (between_SS / total_SS = 22.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Finding the optimal number of clusters by Gap method
fviz_nbclust(heartdata_scaled,kmeans,nstart=25,method="gap_stat",nboot=50)+labs(subtitle = "Gap Statistic Method")
# Numerical summaries of the clusters
cluster_aggregate<-aggregate(heartdata,by=list(cluster=kmeansdf$cluster),mean)
print(cluster_aggregate)
## cluster Age RestingBloodPressure Cholesterol MaximumHeartRate Oldpeak
## 1 1 60.65468 138.2446 259.9065 135.6259 1.6640288
## 2 2 49.03659 126.0122 234.7012 161.5305 0.5103659
## ColoredVessels
## 1 1.2230216
## 2 0.3109756
# Compare the above summaries to the Dataset grouped by Heart Atack Risk
tapply(heartdata_initial$Age, heartdata_initial$HeartAttackRisk, summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.0 52.0 58.0 56.6 62.0 77.0
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.0 44.0 52.0 52.5 59.0 76.0
tapply(heartdata_initial$Cholesterol, heartdata_initial$HeartAttackRisk, summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 131.0 217.2 249.0 251.1 283.0 409.0
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 126.0 208.0 234.0 242.2 267.0 564.0
#PAM Partitioning Around Medoids
# Finding the optimal number of clusters by Silhouette method and describing the cluster characterisitics:
fviz_nbclust(heartdata_scaled,pam,method="silhouette")+theme_classic()
pam_clusters<-pam(heartdata_scaled,2)
print(pam_clusters)
## Medoids:
## ID Age RestingBloodPressure Cholesterol MaximumHeartRate Oldpeak
## [1,] 187 0.620304 -0.09258463 0.1299609 -0.2465324 0.3103986
## [2,] 149 -1.141403 -0.66277043 -0.3909653 0.8449247 -0.8953805
## ColoredVessels
## [1,] 0.2646444
## [2,] -0.7132490
## Clustering vector:
## [1] 1 2 2 2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 2 1 1 2 2
## [38] 1 1 1 1 2 2 1 2 2 2 2 1 2 2 1 1 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2
## [75] 2 1 1 1 2 1 2 2 1 1 2 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1 2 1 1 2 2 1 1 2 2 2 1
## [112] 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 1 2 1
## [149] 2 2 1 1 1 1 2 1 2 2 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1
## [186] 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1
## [297] 1 1 1 2 1 1 1
## Objective function:
## build swap
## 2.099213 2.064042
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
pamcluster_aggregate<-aggregate(heartdata,by=list(cluster=pam_clusters$cluster),mean)
pam_clusters$medoids
## Age RestingBloodPressure Cholesterol MaximumHeartRate Oldpeak
## [1,] 0.620304 -0.09258463 0.1299609 -0.2465324 0.3103986
## [2,] -1.141403 -0.66277043 -0.3909653 0.8449247 -0.8953805
## ColoredVessels
## [1,] 0.2646444
## [2,] -0.7132490
head(pam_clusters$clustering)
## [1] 1 2 2 2 1 1
# To add the cluster to original data for any further explorations.
clusterbind_heartdata<-cbind(heartdata_initial,pam_clusters$cluster)
head(clusterbind_heartdata)
## Age Sex ChestPainType RestingBloodPressure Cholesterol FastingBPmorethan120
## 1 63 1 3 145 233 1
## 2 37 1 2 130 250 0
## 3 41 0 1 130 204 0
## 4 56 1 1 120 236 0
## 5 57 0 0 120 354 0
## 6 57 1 0 140 192 0
## RestingElectrographicResults MaximumHeartRate ExerciseInducedAngina Oldpeak
## 1 0 150 0 2.3
## 2 1 187 0 3.5
## 3 0 172 0 1.4
## 4 1 178 0 0.8
## 5 1 163 1 0.6
## 6 1 148 0 0.4
## Slope ColoredVessels Thal HeartAttackRisk pam_clusters$cluster
## 1 0 0 1 1 1
## 2 0 0 2 1 2
## 3 2 0 2 1 2
## 4 2 0 2 1 2
## 5 2 0 2 1 1
## 6 1 0 1 1 1
fviz_cluster(pam_clusters,ellipse.type = "t",ggtheme=theme_classic())
This technique of partitioning the data into groups does not require prespecifying numberof clusters.
This technique is used for gene expression data analysis.
Two sub types of hierarchical are as follows:
** Agglomerative Clustering (AGNES Agglomerative Nesting) **
** Divisive Clustering (Divise Analysis) ** + This clustering starts from the root and recursively subdivides into two clusters as per the herterogeneity to finally group every instance as a seperate cluster. This algorithm has a “top-down” paradigm.
Hierarchical clustering generates a tree based object representation called the Dendrogram.
To achieve resultant optimal partition of data into clusters the hierarchical tree has to be cut at a certain level.
PAM Clustering
The main linkage functions are:
A good visual and numerical illustrative is provided at the following link:
https://en.wikipedia.org/wiki/Complete-linkage_clustering#Distance_Matrix1
library(readxl)
library(ggplot2)
library(dplyr)
library(corrplot)
library(caret)
library(cluster)
library(factoextra)
library(magrittr)
library(fpc)
# Setting the working directory
setwd('C:\\projects\\MachineLearning')
# Reading the data
heartdata_initial<-read_excel("heart.xlsx")
heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)
# Summarizing the data
summary(heartdata_initial)
## Age Sex ChestPainType RestingBloodPressure Cholesterol
## Min. :29.00 0: 96 0:143 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 1:207 1: 50 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
## 0:258 0:147 Min. : 71.0
## 1: 45 1:152 1st Qu.:133.5
## 2: 4 Median :153.0
## Mean :149.6
## 3rd Qu.:166.0
## Max. :202.0
## ExerciseInducedAngina Oldpeak Slope ColoredVessels Thal
## 0:204 Min. :0.00 Min. :0.000 Min. :0.0000 0: 2
## 1: 99 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000 1: 18
## Median :0.80 Median :1.000 Median :0.0000 2:166
## Mean :1.04 Mean :1.399 Mean :0.7294 3:117
## 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.20 Max. :2.000 Max. :4.0000
## HeartAttackRisk
## 0:138
## 1:165
##
##
##
##
# Scaling the data
heartdata<-heartdata_initial[c(1,4,5,8)]
summary(heartdata)
## Age RestingBloodPressure Cholesterol MaximumHeartRate
## Min. :29.00 Min. : 94.0 Min. :126.0 Min. : 71.0
## 1st Qu.:47.50 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5
## Median :55.00 Median :130.0 Median :240.0 Median :153.0
## Mean :54.37 Mean :131.6 Mean :246.3 Mean :149.6
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :77.00 Max. :200.0 Max. :564.0 Max. :202.0
heartdata_scaled<-scale(heartdata)
summary(heartdata_scaled)
## Age RestingBloodPressure Cholesterol MaximumHeartRate
## Min. :-2.79300 Min. :-2.14525 Min. :-2.3203 Min. :-3.4336
## 1st Qu.:-0.75603 1st Qu.:-0.66277 1st Qu.:-0.6804 1st Qu.:-0.7049
## Median : 0.06977 Median :-0.09259 Median :-0.1209 Median : 0.1464
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.73041 3rd Qu.: 0.47760 3rd Qu.: 0.5448 3rd Qu.: 0.7139
## Max. : 2.49212 Max. : 3.89872 Max. : 6.1303 Max. : 2.2856
# Creating the dissimilarity metric using euclidean distance.
heartdata_dist<-dist(heartdata_scaled,method="euclidean")
# Displayng the distance
head(heartdata_dist)
## [1] 3.412320 2.799080 2.030687 2.870628 1.072887 1.441386
as.matrix(heartdata_dist)[1:7,1:7]
## 1 2 3 4 5 6 7
## 1 0.000000 3.412320 2.799080 2.030687 2.870628 1.072887 1.441386
## 2 3.412320 0.000000 1.187644 2.220144 3.209136 3.053821 2.761474
## 3 2.799080 1.187644 0.000000 1.871547 3.458114 2.140142 2.599257
## 4 2.030687 2.220144 1.871547 0.000000 2.371514 1.936149 1.934928
## 5 2.870628 3.209136 3.458114 2.371514 0.000000 3.390932 1.686193
## 6 1.072887 3.053821 2.140142 1.936149 3.390932 0.000000 1.983073
## 7 1.441386 2.761474 2.599257 1.934928 1.686193 1.983073 0.000000
# Linkage function utilizes the dustance as a proximity metric and pair wise merges the instances thereby creating larger clusters with every successive iteration. Using linkage function ward 2
agg_tree_ward<-hclust(d=heartdata_dist,method="ward.D2")
print(agg_tree_ward)
##
## Call:
## hclust(d = heartdata_dist, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 303
# Visualizing the Dendogram
fviz_dend(agg_tree_ward,cex=.5)
# Cutting the tree to create 2 clusters and visualizng it.
agg_tree_warddend<-fviz_dend(agg_tree_ward,cex=.5,k=2,palette = "jco")
agg_tree_warddend
# To access the partition accuracy of the cluster tree (created by hclust()) there should be a strong correlation between # # the original distance matrix and the object linkage distance defined as cophenetic distances.
# Calculating Cophenetic Distances
agg_cophenetic<-cophenetic(agg_tree_ward)
head(agg_cophenetic)
## [1] 23.098528 23.098528 7.989358 14.023530 3.844826 14.023530
# Calculating the correlation between Cophenetic distances and original distances for :
cor(heartdata_dist,agg_cophenetic)
## [1] 0.4031523
# using average linkage function
agg_tree_average<-hclust(d=heartdata_dist,method="average")
fviz_dend(agg_tree_average,cex=.5)
# Cophenetic Distances
agg_cophenetic<-cophenetic(agg_tree_average)
# correlation between Cophenetic distances and original distances:
cor(heartdata_dist,agg_cophenetic)
## [1] 0.613122
# cut the Tree into clusters
two_groups<-cutree(agg_tree_ward,k=2)
table(two_groups)
## two_groups
## 1 2
## 185 118
head(two_groups,n=4)
## [1] 1 2 2 1
fviz_dend(agg_tree_average,k=3,cex=.5,color_labels_by_k =TRUE,rect=TRUE)
fviz_cluster(list(data=heartdata_scaled,cluster=two_groups))
two_groups<-cutree(agg_tree_average,h=2)
table(two_groups)
## two_groups
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 83 53 37 5 17 39 4 9 11 8 1 4 18 5 1 3 2 1 1 1
head(two_groups,n=4)
## [1] 1 2 2 1
fviz_dend(agg_tree_ward,k=2,cex=.5,color_labels_by_k =TRUE,rect=TRUE)
fviz_cluster(list(data=heartdata_scaled,cluster=two_groups))
# The Cluster package also provides Agglomerative and Divisive methodology
#Agglomerative
agnes_cluster<-agnes(x=heartdata_scaled,stand=TRUE,metric = "euclidean",method="ward")
agnes_cluster$ac
## [1] 0.9707085
agnes_tree<-pltree(agnes_cluster, cex = 0.6, hang = -1, main = "Dendrogram of Agnes")
print(agnes_tree)
## NULL
# plot.hclust()
plot(as.hclust(agnes_cluster), cex = 0.6, hang = -1)
# Divisive
diana_cluster<-diana(x=heartdata_scaled,stand=TRUE,metric = "euclidean")
fviz_dend(agnes_cluster,cex=.6,k=2)
fviz_dend(diana_cluster,cex=.6,k=2)
# Heatmaps are used for Visualizing Hierarchical clustering.
# Heat Maps are used to visualize clusters of samples and features. The high values are in red and low in yellow.
heatmap(heartdata_scaled)
library(gplots)
## Warning: package 'gplots' was built under R version 3.6.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
heatmap.2(heartdata_scaled,scale="none",col=bluered(100),trace = "none",density.info = "none")
# Visually Appealing
library(pheatmap)
## Warning: package 'pheatmap' was built under R version 3.6.3
pheatmap(heartdata_scaled, cutree_rows = 2)
# Interactive Heatmap
library(d3heatmap)
## Warning: package 'd3heatmap' was built under R version 3.6.3
d3heatmap(scale(heartdata),k_row=4,k_col=2)
library(clustertend)
library(factoextra)
heartdata_initial<-read_excel("heart.xlsx")
heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)
# Summarizing the data
summary(heartdata_initial)
## Age Sex ChestPainType RestingBloodPressure Cholesterol
## Min. :29.00 0: 96 0:143 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 1:207 1: 50 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
## 0:258 0:147 Min. : 71.0
## 1: 45 1:152 1st Qu.:133.5
## 2: 4 Median :153.0
## Mean :149.6
## 3rd Qu.:166.0
## Max. :202.0
## ExerciseInducedAngina Oldpeak Slope ColoredVessels Thal
## 0:204 Min. :0.00 Min. :0.000 Min. :0.0000 0: 2
## 1: 99 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000 1: 18
## Median :0.80 Median :1.000 Median :0.0000 2:166
## Mean :1.04 Mean :1.399 Mean :0.7294 3:117
## 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.20 Max. :2.000 Max. :4.0000
## HeartAttackRisk
## 0:138
## 1:165
##
##
##
##
# Scaling the data
heartdata<-heartdata_initial[c(1,4,5,8)]
summary(heartdata)
## Age RestingBloodPressure Cholesterol MaximumHeartRate
## Min. :29.00 Min. : 94.0 Min. :126.0 Min. : 71.0
## 1st Qu.:47.50 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5
## Median :55.00 Median :130.0 Median :240.0 Median :153.0
## Mean :54.37 Mean :131.6 Mean :246.3 Mean :149.6
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :77.00 Max. :200.0 Max. :564.0 Max. :202.0
heartdata_scaled<-scale(heartdata)
summary(heartdata_scaled)
## Age RestingBloodPressure Cholesterol MaximumHeartRate
## Min. :-2.79300 Min. :-2.14525 Min. :-2.3203 Min. :-3.4336
## 1st Qu.:-0.75603 1st Qu.:-0.66277 1st Qu.:-0.6804 1st Qu.:-0.7049
## Median : 0.06977 Median :-0.09259 Median :-0.1209 Median : 0.1464
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.73041 3rd Qu.: 0.47760 3rd Qu.: 0.5448 3rd Qu.: 0.7139
## Max. : 2.49212 Max. : 3.89872 Max. : 6.1303 Max. : 2.2856
# The data do have 2 main by visualization:
fviz_pca_ind(prcomp(heartdata_scaled),title="Heart Attack Risk Data",habillage =heartdata_initial$HeartAttackRisk,palette = "jco",geom = "point",ggtheme=theme_classic(),legend="bottom" )
kmeans_clust<-kmeans(heartdata_scaled,2)
fviz_cluster(list(data=heartdata_scaled,cluster=kmeans_clust$cluster),ellipse.type = "norm",geom="point",stand=FALSE,palette="jco",ggtheme = theme_classic())
# Calculating hopkins statistics which shows that data does exhibit inherent patterns.
hopkins(heartdata_scaled,n=nrow(heartdata_scaled)-1)
## $H
## [1] 0.2277675
# Visualizing the dissimilarity Matrix where red depicts high similarity and blue low similarity
fviz_dist(dist(heartdata_scaled),show_labels = FALSE)+labs(title = "Heart Risk Data Set")
# Comparing Clustering Algorithms
library(clValid)
## Warning: package 'clValid' was built under R version 3.6.3
library(clustertend)
library(factoextra)
heartdata_initial<-read_excel("heart.xlsx")
heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)
# Summarizing the data
summary(heartdata_initial)
## Age Sex ChestPainType RestingBloodPressure Cholesterol
## Min. :29.00 0: 96 0:143 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 1:207 1: 50 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
## 0:258 0:147 Min. : 71.0
## 1: 45 1:152 1st Qu.:133.5
## 2: 4 Median :153.0
## Mean :149.6
## 3rd Qu.:166.0
## Max. :202.0
## ExerciseInducedAngina Oldpeak Slope ColoredVessels Thal
## 0:204 Min. :0.00 Min. :0.000 Min. :0.0000 0: 2
## 1: 99 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000 1: 18
## Median :0.80 Median :1.000 Median :0.0000 2:166
## Mean :1.04 Mean :1.399 Mean :0.7294 3:117
## 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :6.20 Max. :2.000 Max. :4.0000
## HeartAttackRisk
## 0:138
## 1:165
##
##
##
##
# Scaling the data
heartdata<-heartdata_initial[c(1,4,5,8)]
summary(heartdata)
## Age RestingBloodPressure Cholesterol MaximumHeartRate
## Min. :29.00 Min. : 94.0 Min. :126.0 Min. : 71.0
## 1st Qu.:47.50 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5
## Median :55.00 Median :130.0 Median :240.0 Median :153.0
## Mean :54.37 Mean :131.6 Mean :246.3 Mean :149.6
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :77.00 Max. :200.0 Max. :564.0 Max. :202.0
heartdata_scaled<-scale(heartdata)
summary(heartdata_scaled)
## Age RestingBloodPressure Cholesterol MaximumHeartRate
## Min. :-2.79300 Min. :-2.14525 Min. :-2.3203 Min. :-3.4336
## 1st Qu.:-0.75603 1st Qu.:-0.66277 1st Qu.:-0.6804 1st Qu.:-0.7049
## Median : 0.06977 Median :-0.09259 Median :-0.1209 Median : 0.1464
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.73041 3rd Qu.: 0.47760 3rd Qu.: 0.5448 3rd Qu.: 0.7139
## Max. : 2.49212 Max. : 3.89872 Max. : 6.1303 Max. : 2.2856
cluster_method<-c("hierarchical","kmeans","pam")
check<-clValid(heartdata_scaled,nClust=2:6,clMethods=cluster_method,validation="internal")
## Warning in clValid(heartdata_scaled, nClust = 2:6, clMethods = cluster_method, :
## rownames for data not specified, using 1:nrow(data)
summary(check)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 6.9282 9.0115 28.5460 80.0008 81.8048
## Dunn 0.1543 0.1543 0.1072 0.0920 0.0920
## Silhouette 0.3900 0.2925 0.2133 0.1918 0.1262
## kmeans Connectivity 80.7409 98.0278 119.8087 162.1262 168.1238
## Dunn 0.0260 0.0441 0.0598 0.0579 0.0699
## Silhouette 0.2386 0.2394 0.2379 0.1949 0.2077
## pam Connectivity 68.1496 115.7476 160.5782 163.4405 188.6448
## Dunn 0.0313 0.0313 0.0218 0.0546 0.0463
## Silhouette 0.2246 0.2086 0.1929 0.1769 0.1814
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 6.9282 hierarchical 2
## Dunn 0.1543 hierarchical 2
## Silhouette 0.3900 hierarchical 2
cluster_method<-c("hierarchical","kmeans","pam")
check_stability<-clValid(heartdata_scaled,nClust=2:6,clMethods=cluster_method,validation="stability")
## Warning in clValid(heartdata_scaled, nClust = 2:6, clMethods = cluster_method, :
## rownames for data not specified, using 1:nrow(data)
optimalScores(check_stability)
## Score Method Clusters
## APN 0.02777261 hierarchical 2
## AD 2.04004031 kmeans 6
## ADM 0.11080351 hierarchical 2
## FOM 0.94389095 pam 6
Objective:
** Market Basket Analysis and Association Rules**
Measuring Rule Interest
\[Support(X)=\frac{Count(X)}{N}\]
N=number of transactions in the database
X is the number of transactions containing the item or itemset.
If {peanut, jelly, butter} occurs 100 out of 250 transactions then the support is 100/250
A rule’s Confidence measures its predictive power
\[Confidence(X\rightarrow Y)=\frac{support(X,Y)}{support(X)}\]
Numerator is the support of both X and Y divided by the support of X
This intuitively informs us that presence of item X results in the presence of item Y.
The reverse directionality is not implied by this confidence which is usually not true.
Support (X, Y) is analogous to joint probability P(X and Y) whereas Support X is analogous to unconditional or marginal probability P(X) And Confidence (X -> Y) is analogous to Conditional probability P(X|Y).
Strong rules are those that have strong confidence and strong support.
Support {get well -> flowers} = 3/5=60% Strong support
Confidence {get well -> flowers} =.6/.6 =100%
Association Rule {get well -> flowers} is a strong rule